Skip to content

Conversation

@somratdutta
Copy link
Contributor

@somratdutta somratdutta commented Jul 28, 2025

Summary

Checklist

@somratdutta somratdutta requested review from a team as code owners July 28, 2025 20:44
@vercel
Copy link

vercel bot commented Jul 28, 2025

@somratdutta is attempting to deploy a commit to the ClickHouse Team on Vercel.

A member of the Team first needs to authorize it.

@somratdutta
Copy link
Contributor Author

Testing Instructions

This PR depends on a recently merged fix that is not yet available as a Docker image. Below are comprehensive testing instructions to validate the changes locally using Nessie as the REST catalog backend.

Prerequisites

Download the appropriate ClickHouse binary from the build artifacts based on your platform. For macOS on Apple Silicon, use the arm_darwin build.

Environment Setup

  1. Initialize ClickHouse Server

    chmod +x clickhouse
    ./clickhouse server
  2. Deploy Supporting Infrastructure

    Create a docker-compose.yaml file with the following configuration:

    services:
      jupyter:
        image: quay.io/jupyter/pyspark-notebook:2024-10-14
        depends_on:
          minio:
            condition: service_healthy
        command: start-notebook.sh --NotebookApp.token=''
        volumes:
          - ./notebooks:/home/jovyan/examples/
        ports:
          - "8888:8888"
    
      nessie:
        image: ghcr.io/projectnessie/nessie:latest
        ports:
          - "19120:19120"
        environment:
          - nessie.version.store.type=IN_MEMORY
          - nessie.catalog.default-warehouse=warehouse
          - nessie.catalog.warehouses.warehouse.location=s3://my-bucket/
          - nessie.catalog.service.s3.default-options.endpoint=http://minio:9000/
          - nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key
          - nessie.catalog.service.s3.default-options.path-style-access=true
          - nessie.catalog.service.s3.default-options.auth-type=STATIC
          - nessie.catalog.secrets.access-key.name=admin
          - nessie.catalog.secrets.access-key.secret=password
          - nessie.catalog.service.s3.default-options.region=us-east-1
          - nessie.server.authentication.enabled=false
    
      minio:
        image: quay.io/minio/minio
        ports:
          - "9002:9000"
          - "9003:9001"
        environment:
          - MINIO_ROOT_USER=admin
          - MINIO_ROOT_PASSWORD=password
          - MINIO_REGION=us-east-1
        healthcheck:
          test: ["CMD", "mc", "ready", "local"]
          interval: 5s
          timeout: 10s
          retries: 5
          start_period: 30s
        entrypoint: >
          /bin/sh -c "
          minio server /data --console-address ':9001' &
          sleep 10;
          mc alias set myminio http://localhost:9000 admin password;
          mc mb myminio/my-bucket --ignore-existing;
          tail -f /dev/null"

Data Ingestion via PySpark

Create the notebook notebooks/PySpark-Nessie.ipynb to establish test data using PySpark with Nessie and Apache Iceberg:

from pyspark.sql import SparkSession

# Initialize SparkSession with Nessie, Iceberg, and S3 configuration
spark = (
    SparkSession.builder.appName("Nessie-Iceberg-PySpark")
    .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8')
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.nessie.uri", "http://nessie:19120/iceberg/main/")
    .config("spark.sql.catalog.nessie.warehouse", "s3://my-bucket/")
    .config("spark.sql.catalog.nessie.type", "rest")
    .getOrCreate()
)

# Create a namespace in Nessie
spark.sql("CREATE NAMESPACE IF NOT EXISTS nessie.demo").show()

# Create a table in the `nessie.demo` namespace using Iceberg
spark.sql(
    """
    CREATE TABLE IF NOT EXISTS nessie.demo.sample_table (
        id BIGINT,
        name STRING
    ) USING iceberg
    """
).show()

# Insert data into the sample_table
spark.sql(
    """
    INSERT INTO nessie.demo.sample_table VALUES
    (1, 'Alice'),
    (2, 'Bob')
    """
).show()

# Query the data from the table
spark.sql("SELECT * FROM nessie.demo.sample_table").show()

# Stop the Spark session
spark.stop()

Integration Testing

After executing the notebook, connect to ClickHouse and validate the DataLakeCatalog integration with Nessie:

./clickhouse client

Execute the following SQL commands to verify functionality:

-- Enable experimental Iceberg support
SET allow_experimental_database_iceberg = 1;

-- Configure DataLakeCatalog with Nessie REST catalog backend
CREATE DATABASE demo 
ENGINE = DataLakeCatalog('http://localhost:19120/iceberg', 'admin', 'password') 
SETTINGS 
    catalog_type = 'rest', 
    storage_endpoint = 'http://localhost:9002/my-bucket', 
    warehouse = 'warehouse';

-- Verify table discovery
SHOW TABLES FROM demo;

-- Validate data retrieval
SELECT * FROM demo.`demo.sample_table`;

Expected Results

The integration should successfully:

  • Discover the Iceberg table: demo.sample_table
  • Query the table returning the test dataset:
    ┌─id─┬─name──┐
    │  2 │ Bob   │
    │  1 │ Alice │
    └────┴───────┘
    

@somratdutta
Copy link
Contributor Author

Hi @Blargian can you take a look at this?
The dependent PR is merged now.

@Blargian Blargian merged commit 4f672cf into ClickHouse:main Aug 26, 2025
10 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants